We show all of these in an embellished example Snakefile called Snakefile2 in the Snakemake-Example directory. This Snakefile will process all 16 Chinook at 4 Chromosomes from our example data set.
Our Chinook Example: GATK Best Practices “Light”
flowchart TD
A(fastq files from 16 samples: our raw data) --> B(Trim the reads: fastp)
B --> C(Map the reads to a reference genome: bwa-mem2)
C --> D(Mark PCR and optical duplicates: gatk MarkDuplicates)
D --> E(Make gVCF files for each sample/chromo: gatk HaplotypeCaller)
E --> F(Load gVCFs into Genomic DB for each chromo: gatk GenomicsDBImport)
F --> G(Create VCFs from Genomic DB for each chromo: gatk GenotypeGVCFs)
G --> H(Concatenate chromosome-vcfs into a single vcf: bcftools)
A mini data set that only takes about 25 minutes to run through the major steps of a GATK-like variant calling workflow
Chinook salmon sequencing reads (a subset of our course example data).
Three paired-end fastqs from samples A, B, and C and data only from four chromosomes.
We will trim it, map it, mark duplicates, then make one gVCF file for each combination of individual and chromosome (only two chromosomes).
Then, call variants on each of two chromosomes.
Then catenate the resulting VCFs into a single VCF file.
Setting up our workspaces
Sync your fork’s main branch then pull updates into the main branch of your local clone.
We will just be working on a login node, since we will be submitting jobs via Snakemake to SLURM.
Note, to allow Snakemake to keep running on the login node after disconnecting you need to be using tmux or screen (or maybe run it with nohup).
You should already have a snakemake environment set up, according to the directions here
Activate that environment.
We do all activities today this from the top level of the repo
# activate envconda activate snakemake-8.5.3# To make sure snakemake is working, print the help information# for snakemakesnakemake--help# make sure that you are in the top level of the# con-gen-csu repo:pwd# In case we don't have conda environments built here for the# workflow, do this in the start of class and let it runsnakemake--conda-create-envs-only-s Snakemake-Example/Snakefile2
Note: If you are using a version of Snakemake < 8.0 (like some of the folks at NMFS) things should all work out, but the actual profiles are a little different.
Updated Snakefile and associated files
We can use the Unix tree utility to see what the Snakemake-Example directory contains.
Within the Snakemake-Example directory, type tree at the command line. The new files here are:
Snakefile2. Much more about that later.
A new file named config.yaml
A new file called sample-info.tsv
A directory slurm with some profiles
The smk directory has profiles using Snakemakes officially supported SLURM executor.
The jdb directory has profiles for running SLURM with John D. Blischak’s simple-slurm-smk approach.
If I need to cancel the run, <cntrl>-c once will tell Snakemake to start killing all the jobs using scancel. Be patient and let it do its work.
When we start it back up, Snakemake knows which files need to be regenerated, even if they were already partially made.
What is this profile?
A profile is a directory that contains a file config.yaml inside it, and possibly other files.
The config.yaml could be a config.v8+.yaml which is used preferentially by Snakemake versions >= 8.0.
This yaml file stores command line options:
--option arg becomes option: arg
Options with no arguments use: option: True
These options get put on the command line.
Contents of Snakemake-Example/slurm/jdb/alpine/config.v8+.yaml
executor: cluster-genericcluster-generic-submit-cmd: mkdir -p results/slurm_logs/{rule} && sbatch --partition=amilan,csu --cpus-per-task={threads} --mem={resources.mem_mb} --time={resources.time} --job-name=smk-{rule}-{wildcards} --output=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.out --error=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.err --parsablecluster-generic-status-cmd: status-sacct-robust.shcluster-generic-cancel-cmd: scancelcluster-generic-cancel-nargs:400# warning, time here is very small. It works for the small# example data set, but should be increased for production jobsdefault-resources:- time="00:30:00"- mem_mb=3740- tmpdir="results/snake-tmp"restart-times:0max-jobs-per-second:10max-status-checks-per-second:25local-cores:1latency-wait:60cores:2400jobs:950keep-going:Truererun-incomplete:Trueprintshellcmds:Truesoftware-deployment-method: condarerun-trigger: mtime
The “SLURM” parts of the profile
executor: use the generic cluster snakemake executor plugin.
cluster-generic-submit-cmd: use this command to submit jobs the cluster. (Note the {resources.time}, etc.)
cluster-generic-status-cmd: command snakemake can use to check status of jobs
cluster-generic-cancel-cmd: command to use to kill running jobs if we terminate Snakemake (i.e., do <cntrl>-c).
default-resources: Unless otherwise noted (in the rule) use these as the compute resources for each job. Note the memory here is customized for Alpine. (The SEDNA profile is customized for SEDNA).
Contents of Snakemake-Example/slurm/jdb/alpine/config.v8+.yaml
executor: cluster-genericcluster-generic-submit-cmd: mkdir -p results/slurm_logs/{rule} && sbatch --partition=amilan,csu --cpus-per-task={threads} --mem={resources.mem_mb} --time={resources.time} --job-name=smk-{rule}-{wildcards} --output=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.out --error=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.err --parsablecluster-generic-status-cmd: status-sacct-robust.shcluster-generic-cancel-cmd: scancelcluster-generic-cancel-nargs:400# warning, time here is very small. It works for the small# example data set, but should be increased for production jobsdefault-resources:- time="00:30:00"- mem_mb=3740- tmpdir="results/snake-tmp"restart-times:0max-jobs-per-second:10max-status-checks-per-second:25local-cores:1latency-wait:60cores:2400jobs:950keep-going:Truererun-incomplete:Trueprintshellcmds:Truesoftware-deployment-method: condarerun-trigger: mtime
JDB’s cluster status command
The standard Snakemake profile for SLURM uses a python script to check cluster status, and this seems to take much more time to invoke than just running JDB’s shell script.
Main purpose: return running, success, or failed for any SLURM job number.
Much of the code here is in place to give SLURM many chances to tell us how the job is doing. (In case SLURM is busy-ish…)
You don’t need to ever edit this, most likely, but you probably should make sure it is executable if you add it to an update profile.
Contents of Snakemake-Example/slurm/jdb/alpine/status-sacct-robust.sh
#!/usr/bin/env bash# Check status of Slurm job. More robust because runs `sacct` multiple times if# neededjobid="$1"if[["$jobid"== Submitted ]]thenecho smk-simple-slurm: Invalid job ID: "$jobid">&2echo smk-simple-slurm: Did you remember to add the flag --parsable to your sbatch call?>&2exit 1fifunction get_status(){sacct-j"$1"--format State --noheader|head-n 1 |awk'{print $1}'}for i in{1..20}dooutput=`get_status"$jobid"`if[[!-z$output]]thenbreakelsesleep 3fidoneif[[-z$output]]thenecho sacct failed to return the status for jobid "$jobid">&2echo Maybe you need to use scontrol instead?>&2exit 1fiif[[$output=~^(COMPLETED).*]]thenecho successelif[[$output=~^(RUNNING|PENDING|COMPLETING|CONFIGURING|SUSPENDED).*]]thenecho runningelseecho failedfi
Now, you try it on Alpine (or SEDNA)
This will submit a lot of jobs to the cluster, (and we will see if that is a problem—especially if Snakemake is checking job statuses often).
Do this on a login node.
On Snakemake version >= 8.0 you have to get the snakemake executor plugin for a generic cluster. Don’t do this for Snakemake < 8.0:
# your snakemake environment must be activated for this to workpip install snakemake-executor-plugin-cluster-generic
On Alpine
# do a dry run:snakemake-np-s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/alpine# do a real run:snakemake-p-s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/alpine
On SEDNA
# do a dry run:snakemake-np-s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/sedna# do a real run:snakemake-p-s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/sedna
When the job is running
Check your queued and running jobs with myjobs or myjobs 40 to see 40 characters of the job name.
See that the job names actually make sense. (This is not the case with the official Snakemake slurm executor)
Check the slurm logs that are written in results/slurm_logs with reasonable names.
Eric shows the official snakemake slurm executor
This would only work on Version >= 8.0
I will run this on SEDNA:
# before running this, it is necessary to do this (once) while# the snakemake environment is active:pip install snakemake-executor-plugin-slurm# now, I call it with the profile that uses the official# snakemake slurm plugin.# Dry run:snakemake-np-s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/smk/sedna# Real Run:snakemake-p-s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/smk/sedna
Check it with myjobs 50. Wow! Those are some Fugly and useless job names:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)CPUS MIN_MEMORY TIME_LIMIT TIME_LEFT PRIORITY867355 standard f8b21817-fa1d-4a40-905d-2e9028d14c03 eanderson PD 0:00 1 (Priority)1 4800M 8:00:00 8:00:00 0.99990073288789867356 standard f8b21817-fa1d-4a40-905d-2e9028d14c03 eanderson PD 0:00 1 (Priority)1 4800M 8:00:00 8:00:00 0.99990073265506867357 standard f8b21817-fa1d-4a40-905d-2e9028d14c03 eanderson PD 0:00 1 (Priority)1 4800M 8:00:00 8:00:00 0.99990073242222867358 standard f8b21817-fa1d-4a40-905d-2e9028d14c03 eanderson PD 0:00 1 (Priority)1 4800M 8:00:00 8:00:00 0.99990073218939867359 standard f8b21817-fa1d-4a40-905d-2e9028d14c03 eanderson PD 0:00 1 (Priority)1 4800M 8:00:00 8:00:00 0.99990073195656
A First High-level View of Snakefile2
Contents of Snakemake-Example/Snakefile2
# start with some stuff to get the Snakemake versionfrom snakemake import __version__ as snakemake_versionsmk_major_version =int(snakemake_version[0])# import modules as needed. Also import the snakemake# internal command to load a config file into a dictimport pandas as pdif smk_major_version >=8: from snakemake.common.configfile import _load_configfileelse: from snakemake.io import _load_configfile# define rules that don't need to be run on a compute node.# i.e. those that can be run locally.localrules: all, genome_faidx, genome_dict### Get a dict named config from Snakemake-Example/config.yamlconfigfile:"Snakemake-Example/config.yaml"### Get the sample info table read into a pandas data framesample_table=pd.read_table(config["sample_info"], dtype="str").set_index("sample", drop=False)### Transfer values from the yaml and tabular config to### our familiar lists, SAMPLES and CHROMOS# Populate our SAMPLES list from the sample_table using a little# pandas syntaxSAMPLES=sample_table["sample"].unique().tolist()# Define CHROMOS from the values in the config fileCHROMOS=config["chromos"]### Input Functins that use the tabular sample_info# define a function to get the fastq path from the sample_table. This# returns it as a dict, so we need to unpack it in the ruledef get_fastqs(wildcards): fq1=sample_table.loc[ wildcards.sample, "fq1" ] fq2=sample_table.loc[ wildcards.sample, "fq2" ] return {"r1": fq1, "r2": fq2 }# define a function for getting the read group information# from the sample table for each particular sample (according# to the wildcard value)def get_read_group(wildcards):"""Denote sample name and platform in read group.""" return r"-R '@RG\tID:{sample}_{sample_id}_{library}_{flowcell}_{lane}_{barcode}\tSM:{sample_id}\tPL:{platform}\tLB:{library}\tPU:{flowcell}.{lane}.{barcode}'".format(sample=wildcards.sample,sample_id=sample_table.loc[(wildcards.sample), "sample_id"],platform=sample_table.loc[(wildcards.sample), "platform"],library=sample_table.loc[(wildcards.sample), "library"],flowcell=sample_table.loc[(wildcards.sample), "flowcell"],lane=sample_table.loc[(wildcards.sample), "lane"],barcode=sample_table.loc[(wildcards.sample), "barcode"], )### Specify rule "all"# By default, Snakemake tries to create the input files needed# for the first rule in the Snakefile, so we define the first# rule to ask for results/vcf/all.vcf.gzrule all: input:"results/vcf/all.vcf.gz"rule genome_faidx: input:"data/genome/genome.fasta", output:"data/genome/genome.fasta.fai", conda:"envs/bwa2sam.yaml" log:"results/logs/genome_faidx.log", shell:"samtools faidx {input} 2> {log} "rule genome_dict: input:"data/genome/genome.fasta", output:"data/genome/genome.dict", conda:"envs/bwa2sam.yaml" log:"results/logs/genome_dict.log", shell:"samtools dict {input} > {output} 2> {log} "rule bwa_index: input:"data/genome/genome.fasta" output:multiext("data/genome/genome.fasta", ".0123", ".amb", ".ann", ".bwt.2bit.64", ".pac"), conda:"envs/bwa2sam.yaml" log: out="results/logs/bwa_index/bwa_index.log", err="results/logs/bwa_index/bwa_index.err" shell:"bwa-mem2 index {input} > {log.out} 2> {log.err} "rule trim_reads: input:unpack(get_fastqs) # unpack creates named inputs from the dict that# get_fastqs returns output: r1="results/trimmed/{sample}_R1.fastq.gz", r2="results/trimmed/{sample}_R2.fastq.gz", html="results/qc/fastp/{sample}.html", json="results/qc/fastp/{sample}.json" conda:"envs/fastp.yaml" log: out="results/logs/trim_reads/{sample}.log", err="results/logs/trim_reads/{sample}.err", params: as1=config["params"]["fastp"]["adapter_sequence1"], as2=config["params"]["fastp"]["adapter_sequence2"], parm=config["params"]["fastp"]["other_options"] shell:" fastp -i {input.r1} -I {input.r2} "" -o {output.r1} -O {output.r2} "" -h {output.html} -j {output.json} "" --adapter_sequence={params.as1} "" --adapter_sequence_r2={params.as2} "" {params.parm} > {log.out} 2> {log.err} "rule map_reads: input: r1="results/trimmed/{sample}_R1.fastq.gz", r2="results/trimmed/{sample}_R2.fastq.gz", genome="data/genome/genome.fasta", idx=multiext("data/genome/genome.fasta", ".0123", ".amb", ".ann", ".bwt.2bit.64", ".pac") output:"results/bam/{sample}.bam" conda:"envs/bwa2sam.yaml" log:"results/logs/map_reads/{sample}.log" threads:2 resources: mem_mb=7480, time="01:00:00" params: RG=get_read_group shell:" (bwa-mem2 mem -t {threads} {params.RG} {input.genome} {input.r1} {input.r2} | "" samtools view -u | "" samtools sort - > {output}) 2> {log} "rule mark_duplicates: input:"results/bam/{sample}.bam" output: bam="results/mkdup/{sample}.bam", bai="results/mkdup/{sample}.bai", metrics="results/qc/mkdup_metrics/{sample}.metrics" conda:"envs/gatk.yaml" log:"results/logs/mark_duplicates/{sample}.log" shell:" gatk MarkDuplicates "" --CREATE_INDEX "" -I {input} "" -O {output.bam} "" -M {output.metrics} > {log} 2>&1 "rule make_gvcfs_by_chromo: input: bam="results/mkdup/{sample}.bam", bai="results/mkdup/{sample}.bai", ref="data/genome/genome.fasta", idx="data/genome/genome.dict", fai="data/genome/genome.fasta.fai" output: gvcf="results/gvcf/{chromo}/{sample}.g.vcf.gz", idx="results/gvcf/{chromo}/{sample}.g.vcf.gz.tbi", conda:"envs/gatk.yaml" log:"results/logs/make_gvcfs_by_chromo/{chromo}/{sample}.log" params: java_opts="-Xmx4g", hmt=config["params"]["gatk"]["HaplotypeCaller"]["hmm_threads"] shell:" gatk --java-options \"{params.java_opts}\" HaplotypeCaller "" -R {input.ref} "" -I {input.bam} "" -O {output.gvcf} "" -L {wildcards.chromo} "" --native-pair-hmm-threads {params.hmt} "" -ERC GVCF > {log} 2> {log} "rule import_genomics_db_by_chromo: input: gvcfs=expand("results/gvcf/{{chromo}}/{s}.g.vcf.gz", s=SAMPLES) output: gdb=directory("results/genomics_db/{chromo}") conda:"envs/gatk.yaml" log:"results/logs/import_genomics_db_by_chromo/{chromo}.log" params: java_opts="-Xmx4g" shell:" VS=$(for i in {input.gvcfs}; do echo -V $i; done); "# make a string like -V file1 -V file2" gatk --java-options \"-Xmx4g\" GenomicsDBImport "" $VS "" --genomicsdb-workspace-path {output.gdb} "" -L {wildcards.chromo} 2> {log} "rule vcf_from_gdb_by_chromo: input: gdb="results/genomics_db/{chromo}", ref="data/genome/genome.fasta", fai="data/genome/genome.fasta.fai", idx="data/genome/genome.dict", output: vcf="results/chromo_vcfs/{chromo}.vcf.gz", idx="results/chromo_vcfs/{chromo}.vcf.gz.tbi", conda:"envs/gatk.yaml" log:"results/logs/vcf_from_gdb_by_chromo/{chromo}.txt" shell:" gatk --java-options \"-Xmx4g\" GenotypeGVCFs "" -R {input.ref} "" -V gendb://{input.gdb} "" -O {output.vcf} 2> {log} "rule concat_vcfs: input: vcfs=expand("results/chromo_vcfs/{c}.vcf.gz", c=CHROMOS) output: vcf="results/vcf/all.vcf.gz" conda:"envs/bcftools.yaml" log:"results/concat_vcfs/all.log" shell:"bcftools concat -n {input.vcfs} > {output.vcf} 2> {log} "
### Get a dict named config from Snakemake-Example/config.yamlconfigfile: "Snakemake-Example/config.yaml"
This loads values from a YAML file into a dict variable named config.
Note that this config.yaml is completely different from the config.yaml in the SLURM profile we just looked at.
What does this config.yaml have in it?
Contents of Snakemake-Example/config.yaml
# path to the genome (not used in workflow, but we could have...)genome_path:"data/genome/genome.fasta"# path to file with information about samplessample_info:"Snakemake-Example/sample-info.tsv"# Put the list of chromosomes we want to do herechromos:-"NC_037122.1f5t9"-"NC_037123.1f10t14"-"NC_037124.1f8t16"-"NC_037125.1f20t24"# parameters to be used for different rules/programsparams:fastp:adapter_sequence1:"AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"adapter_sequence2:"AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"other_options:-"--detect_adapter_for_pe"-"--cut_right"-"--cut_right_window_size 4"-"--cut_right_mean_quality 20"gatk:HaplotypeCaller:hmm_threads:1
Config file variables used in Snakefile
Get the chromosome names out of the config:
# Define CHROMOS from the values in the config fileCHROMOS=config["chromos"]
You can grab values from the config dict variable in the input, output, or params, etc. blocks.
Tabular Configuration in a Snakefile:
In the config:
# path to file with information about samplessample_info:"Snakemake-Example/sample-info.tsv"
Click here to see what that tabular file looks like:
Note, we are using actual sample IDs from our lab for these instead of the DPCh_plate1_F10_S70 sorts of sample names.
But now, how do we find the right fastq files?
Read it in the snakefile:
### Get the sample info table read into a pandas data framesample_table=pd.read_table(config["sample_info"], dtype="str").set_index("sample", drop=False)### Transfer values from the yaml and tabular config to### our familiar lists, SAMPLES and CHROMOS# Populate our SAMPLES list from the sample_table using a little# pandas syntaxSAMPLES=sample_table["sample"].unique().tolist()